199 research outputs found

    Teaching a New Dog Old Tricks: Resurrecting Multilingual Retrieval Using Zero-shot Learning

    Full text link
    While billions of non-English speaking users rely on search engines every day, the problem of ad-hoc information retrieval is rarely studied for non-English languages. This is primarily due to a lack of data set that are suitable to train ranking algorithms. In this paper, we tackle the lack of data by leveraging pre-trained multilingual language models to transfer a retrieval system trained on English collections to non-English queries and documents. Our model is evaluated in a zero-shot setting, meaning that we use them to predict relevance scores for query-document pairs in languages never seen during training. Our results show that the proposed approach can significantly outperform unsupervised retrieval techniques for Arabic, Chinese Mandarin, and Spanish. We also show that augmenting the English training collection with some examples from the target language can sometimes improve performance.Comment: ECIR 2020 (short

    Query Expansion for Survey Question Retrieval in the Social Sciences

    Full text link
    In recent years, the importance of research data and the need to archive and to share it in the scientific community have increased enormously. This introduces a whole new set of challenges for digital libraries. In the social sciences typical research data sets consist of surveys and questionnaires. In this paper we focus on the use case of social science survey question reuse and on mechanisms to support users in the query formulation for data sets. We describe and evaluate thesaurus- and co-occurrence-based approaches for query expansion to improve retrieval quality in digital libraries and research data archives. The challenge here is to translate the information need and the underlying sociological phenomena into proper queries. As we can show retrieval quality can be improved by adding related terms to the queries. In a direct comparison automatically expanded queries using extracted co-occurring terms can provide better results than queries manually reformulated by a domain expert and better results than a keyword-based BM25 baseline.Comment: to appear in Proceedings of 19th International Conference on Theory and Practice of Digital Libraries 2015 (TPDL 2015

    Improving ranking for systematic reviews using query adaptation

    Get PDF
    Identifying relevant studies for inclusion in systematic reviews requires significant effort from human experts who manually screen large numbers of studies. The problem is made more difficult by the growing volume of medical literature and Information Retrieval techniques have proved to be useful to reduce workload. Reviewers are often interested in particular types of evidence such as Diagnostic Test Accuracy studies. This paper explores the use of query adaption to identify particular types of evidence and thereby reduce the workload placed on reviewers. A simple retrieval system that ranks studies using TF.IDF weighted cosine similarity was implemented. The Log-Likelihood, ChiSquared and Odds-Ratio lexical statistics and relevance feedback were used to generate sets of terms that indicate evidence relevant to Diagnostic Test Accuracy reviews. Experiments using a set of 80 systematic reviews from the CLEF2017 and CLEF2018 eHealth tasks demonstrate that the approach improves retrieval performance

    Probabilistic models of information retrieval based on measuring the divergence from randomness

    Get PDF
    We introduce and create a framework for deriving probabilistic models of Information Retrieval. The models are nonparametric models of IR obtained in the language model approach. We derive term-weighting models by measuring the divergence of the actual term distribution from that obtained under a random process. Among the random processes we study the binomial distribution and Bose--Einstein statistics. We define two types of term frequency normalization for tuning term weights in the document--query matching process. The first normalization assumes that documents have the same length and measures the information gain with the observed term once it has been accepted as a good descriptor of the observed document. The second normalization is related to the document length and to other statistics. These two normalization methods are applied to the basic models in succession to obtain weighting formulae. Results show that our framework produces different nonparametric models forming baseline alternatives to the standard tf-idf model

    Querying a Bioinformatic Data Sources Registry with Concept Lattices

    Get PDF
    ISSN 0302-9743 (Print) 1611-3349 (Online) ISBN 978-3-540-27783-5International audienceBioinformatic data sources available on the web are multiple and heterogenous. The lack of documentation and the difficulty of interaction with these data banks require users competence in both informatics and biological fields for an optimal use of sources contents that remain rather under exploited. In this paper we present an approach based on formal concept analysis to classify and search relevant bioinformatic data sources for a given user query. It consists in building the concept lattice from the binary relation between bioinformatic data sources and their associated metadata. The concept built from a given user query is then merged into the concept lattice. The result is given by the extraction of the set of sources belonging to the extents of the query concept subsumers in the resulting concept lattice. The sources ranking is given by the concept specificity order in the concept lattice. An improvement of the approach consists in automatic refinement of the query thanks to domain ontologies. Two forms of refinement are possible by generalisation and by specialisation

    Intrasession and Between-Visit Variability of Sector Peripapillary Angioflow Vessel Density Values Measured with the Angiovue Optical Coherence Tomograph in Different Retinal Layers in Ocular Hypertension and Glaucoma

    Get PDF
    PURPOSE: To evaluate intrasession and between-visit reproducibility of sector peripapillary angioflow vessel-density (PAFD, %) values in the optic nerve head (ONH) and radial peripapillary capillaries (RPC) layers, respectively, and to analyze the influence of the corresponding sector retinal nerve fiber layer thickness (RNFLT) on the results. METHODS: High quality images acquired with the Angiovue/RTVue-XR Avanti optical coherence tomograph (Optovue Inc., Fremont, USA) on 1 eye of 18 stable glaucoma and ocular hypertension patients were analyzed using the Optovue 2015.100.0.33 software version. Three images were acquired in one visit and 1 image 3 months later. RESULTS: PAFD image quality for all images necessary to calculate reproducibility was sufficient to analysis only in 18 of the 83 participants (21.7%) who were successfully imaged for RNFLT. Intrasession coefficient of variation (CV) ranged between 2.30 and 3.89%, and 3.51 and 5.12% for the peripapillary sectors in the ONH and RPC layers, respectively. The corresponding between-visit CV values ranged between 3.05 and 4.26%, and 4.99 and 6.90%, respectively. Intrasession SD did not correlate with the corresponding RNFLT in any sector in either layer (P>/=0.170). In the ONH layer sector PAFD values did not correlate with the corresponding RNFLT values (P>/=0.100). In contrast, in the RPC layer a significant positive correlation between the corresponding sector PAFD and RNFLT values was found for all but one peripapillary sectors (Pearson-r range: 0.652 to 0.771, P</=0.0046). CONCLUSION: Though in several patients routine use of PAFD measurement may be limited by suboptimal image quality, in the successfully imaged cases (21.7% of the study eyes in the current investigation) reproducibility of sector PAFD values seems to be sufficient for clinical research. In stable patients intrasession variability explains most of the between-visit variability. Sector PAFD variability is independent from sector RNFLT, a marker of glaucoma severity. In the RPC layer sector PAFD and RNFLT show strong to very strong positive correlation
    corecore